This analysis looks into a building a model to for understanding \(\mathbb{P}(\text{success})\).
We first need to get the data from mlflow. This analysis goes into detail on the steps required to go from loading the data to building various models. We also conduct an EDA on the instance features.
# Import libraries
library(here)
library(igraph)
library(cowplot)
library(grid)
library(gridExtra)
library(tidyverse)
library(GGally)
library(RColorBrewer)
library(corrplot)
# Set knitr options
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F, error = F, fig.align = 'center', fig.width = 10)
# Source in relavent scripts
source(here("utils/mlflow-utils.R"))
source(here("utils/plotting-utils.R"))
source(here("analysis/enrichments/variable-clause-graph.R"))
source(here("analysis/enrichments/variable-graph.R"))
# Load in enriched data
d_enriched <- read_rds(here("data/d_enriched.rds"))
d_runs <- get_mlflow_data(here("data/d_runs.csv"))
## 2020-05-28 12:27:36 INFO::Data last updated 2020-05-27 23:13:37
Our enriched dataset has a naming convention based on the prefix in each column, we describe the numeric columns below:
metrics_<METRIC>: With the exception of metrics_clause_var_ratio (Clause to variable ratio), these are metrics which are calculated during our simulation and cannot be calculated prior to evolving our system.params_<PARAMETER>: With the exception of params_instance_index (Index for randomly generated problem), these are input parameters to our system’s evolutionf_p_size_<FEATURE>: These are features related to the problem size of our instance.f_vcg_<FEATURE>: These are features related to the variable clause graph of our instance.f_vg_<FEATURE>: These are features related to the variable graph of our instanceVariable-Clause Graph (VCG) is a bipartite graph with a node for each variable, a node for each clause, and an edge between them whenever a variable occurs in a clause.
example_d_clauses = list(
k_1 = c(1L, 3L, 4L),
k_2 = c(1L, 5L, 6L),
k_3 = c(1L, 4L, 6L),
k_4 = c(2L, 5L, 6L),
k_5 = c(1L, 2L, 5L)
)
colrs = c("tomato", "gold")
example_qubits = 6
example_var_clause_graph = make_variable_clause_graph(d_clauses = example_d_clauses, example_qubits, ret.all = T)
ggnet2(example_var_clause_graph$vcg, color = "color", palette = "Set2") +
scale_color_brewer(
"",
palette = "Set1",
labels = c("tomato" = "Variable", "gold" = "Clause"),
guide = guide_legend(override.aes = list(size = 6))
)
Variable Graph (VG) has a node for each variable and an edge between variables that occur together in at least one clause.
example_d_clauses = list(
k_1 = c(1L, 3L, 4L),
k_2 = c(1L, 5L, 6L),
k_3 = c(1L, 4L, 6L),
k_4 = c(2L, 5L, 6L),
k_5 = c(1L, 2L, 5L)
)
example_qubits = 6
d_clauses = example_d_clauses
n_qubits = example_qubits
var_graph = make_variable_graph(example_d_clauses, example_qubits)
ggnet2(var_graph)
Before modelling we need to ensure that each configuration of parameters has the same number of randomly generated samples. We will only take runs that have had atleast \(n=5\) completed runs. We will also enrich another column called use_config (bool). This column will indicate whether that instance should be used in our analysis.
min_runs <- 10
# Find run configs with more than n=min_runs
d_configs <- d_enriched %>%
select_if(is.numeric) %>%
select(starts_with("params")) %>%
count(params_n_qubits, params_time_t, params_t_step) %>%
arrange(n) %>%
filter(n >= min_runs) %>%
mutate(load = T)
d_sampled <- d_enriched %>%
left_join(d_configs, by = c("params_n_qubits", "params_time_t", "params_t_step")) %>%
filter(load == T) %>%
group_by(params_n_qubits, params_time_t, params_t_step) %>%
sample_n(min_runs) %>%
select_if(is.numeric) %>%
select(-experiment_id, -n) %>%
ungroup()
Before we start building a model it’s useful to have a look at what experiments have ran so far.
d_runs %>%
count(params_n_qubits, params_t_step, params_time_t) %>%
arrange(n) %>%
ggplot(aes(x = as.factor(params_n_qubits), y = n, fill = as.factor(params_time_t))) +
geom_col(aes(fill = as.factor(params_time_t)), position = position_dodge(width = 0.9)) +
geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.7, size = 2.6) +
scale_fill_brewer(palette = "Blues", name = "Evolution Time") +
theme_light() +
facet_wrap(~params_t_step) +
theme(
legend.position = "bottom"
) +
labs(
x = "n qubits",
y = "Number of runs",
title = "Summary of Experiments Run "
)
Great, now when we sample our dataset, we see each column has the same number of completed runs in our case 10.
d_sampled %>%
count(params_n_qubits, params_t_step, params_time_t) %>%
arrange(n) %>%
ggplot(aes(x = as.factor(params_n_qubits), y = n, fill = as.factor(params_time_t))) +
geom_col(aes(fill = as.factor(params_time_t)), position = position_dodge(width = 0.9)) +
geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.7, size = 2.6) +
scale_fill_brewer(palette = "Blues", name = "Evolution Time") +
theme_light() +
facet_wrap(~params_t_step) +
theme(
legend.position = "bottom"
) +
labs(
x = "n qubits",
y = "Number of runs",
title = "Summary of Experiments Run "
)
d_sampled %>%
group_by(params_n_qubits) %>%
summarise(
entanglement = mean(metrics_max_shannon_entropy,na.rm = T),
min_energy = mean(metrics_min_energy_gap, na.rm = T)
) %>%
gather(metric, value, -params_n_qubits) %>%
ggplot(aes(x = params_n_qubits, y = value, group = 1)) +
geom_line() +
facet_wrap(~metric, scales = "free", ncol = 1) +
theme_light() +
labs(
x = "N Qubits",
y = "Value"
)
We notice that for a smaller number of qubits we have almost no variance in the number of instances generated (this is noticed by clause-to-variable-ratio). We also see that our maximum shannon entropy decreases as we run our evolution longer.
d_sampled %>%
select(
metrics_clause_var_ratio,
metrics_min_energy_gap,
metrics_max_shannon_entropy,
params_n_qubits,
params_time_t
) %>%
mutate_all(as.numeric) %>%
mutate(
params_n_qubits = as.factor(params_n_qubits)
) %>%
ggplot(aes(x = metrics_clause_var_ratio, y = metrics_max_shannon_entropy, col = params_time_t)) +
geom_point(alpha = 0.8) +
facet_wrap(~params_n_qubits) +
theme_light() +
labs(
x = "Clause-Variable Ratio",
y = "Max Shannon Entropy",
title = "Shannon Entropy vs Clause-Var Ratio (by qubits)"
)
One of the key metrics we’re interested in is \(\mathbb{P}(\text{success})\). As we see probability of sucess increasing as the evolution run time of the algorithm continues, we observe a higher \(\mathbb{P}(\text{success})\). We also observe a higher variance.
d_sampled %>%
select(metrics_p_success, params_n_qubits, params_time_t, params_t_step) %>%
mutate(
params_time_t = as.numeric(params_time_t),
params_t_step = as.factor(params_t_step)
) %>%
ggplot(aes(x = params_time_t, y = metrics_p_success, col = params_t_step)) +
geom_point(alpha = 0.5) +
facet_wrap(~as.numeric(params_n_qubits)) +
labs(
y = "Probability of Success",
x = "Evolution Time"
) +
theme_light()
Here we clearly observe that a lower shannon entropy corresponds to a higher probability of success, but this may also be confounded due to the algorithm run time, \(T\).
d_sampled %>%
select(metrics_p_success, metrics_max_shannon_entropy, params_time_t) %>%
ggplot(aes(x = metrics_p_success, y = metrics_max_shannon_entropy, col = params_time_t)) +
geom_point(alpha = 0.5) +
labs(
x = "Probability of Success",
y = "Shannon Entropy"
) +
theme_light()
d_sampled %>%
select(metrics_p_success, metrics_max_shannon_entropy, params_time_t, params_n_qubits) %>%
ggplot(aes(x = metrics_p_success, y = metrics_max_shannon_entropy, col = params_time_t)) +
geom_point(alpha = 0.6) +
facet_wrap(~params_n_qubits) +
labs(
x = "Probability of Success",
y = "Shannon Entropy",
title = " P(success) vs Entropy (by Qubits)"
) +
theme_light()
d_sampled %>%
select(metrics_p_success, metrics_min_energy_gap, params_time_t) %>%
ggplot(aes(x = metrics_p_success, y = metrics_min_energy_gap, col = params_time_t)) +
geom_point(alpha = 0.6) +
theme_light() +
labs(
x = "Probability of Success",
y = "Min Energy Gap",
title = "P(success) vs Min Energy Gap"
)
d_sampled %>%
select(metrics_p_success, metrics_min_energy_gap, params_time_t, params_n_qubits) %>%
ggplot(aes(x = metrics_p_success, y = metrics_min_energy_gap, col = params_time_t)) +
geom_point(alpha = 0.8) +
facet_wrap(~params_n_qubits) +
theme_light() +
labs(
x = "Probability of Success",
y = "Min Energy Gap",
title = "P(Success) vs Energy Gap (by Qubit)"
)
d_sampled %>%
select(metrics_p_success, metrics_clause_var_ratio, params_n_qubits, params_time_t) %>%
ggplot(aes(x = metrics_clause_var_ratio, y = metrics_p_success, col = params_time_t)) +
geom_point() +
facet_wrap(~params_n_qubits) +
theme_light() +
labs(
x = "Clause-Variable Ratio",
y = "P(Success)"
)
We can now begin to model \(\mathbb{P}(\text{success})\), let’s start off with a simple linear model.
First we need to remove columns with no standard deviation and also ones with missing values.
We can now checkout a correlation matrix.
d_model <- d_sampled %>%
select_if(is.numeric) %>%
select(-starts_with(c("metric", "params")), metrics_p_success) %>%
filter_all(all_vars(!is.na(.))) %>%
Filter(f = sd, .)
M <-cor(d_model)
corrplot(
M,
type="upper",
order="original",
col=brewer.pal(n=8, name="RdYlBu"),
tl.cex = 0.5,
tl.col = "blue"
)
Let’s look at the model summary:
d_model %>%
lm(data = ., formula = metrics_p_success ~ .) %>%
summary()
##
## Call:
## lm(formula = metrics_p_success ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45134 -0.16125 -0.02477 0.18398 0.53792
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.240e+06 1.022e+06 -1.213 0.22564
## f_p_size_n_clauses -3.616e-01 1.995e-01 -1.812 0.07056 .
## f_p_size_n_variables 1.993e-01 1.599e-01 1.247 0.21305
## f_p_size_ratio 1.173e+06 9.742e+05 1.204 0.22924
## f_p_size_ratio_sq -5.909e+05 4.945e+05 -1.195 0.23263
## f_p_size_ratio_cub 1.239e+05 1.044e+05 1.187 0.23578
## f_p_size_ratio_recp 7.357e+05 6.015e+05 1.223 0.22188
## f_p_size_ratio_recp_sq -2.325e+05 1.885e+05 -1.233 0.21800
## f_p_size_ratio_recp_cub 3.055e+04 2.455e+04 1.244 0.21404
## f_p_size_lin_ratio NA NA NA NA
## f_p_size_lin_ratio_sq NA NA NA NA
## f_p_size_lin_ratio_cb NA NA NA NA
## f_vcg_mean_var 1.318e+00 6.503e-01 2.027 0.04318 *
## f_vcg_median_var 5.799e-02 3.404e-02 1.703 0.08916 .
## f_vcg_min_var 2.720e-01 1.917e-01 1.419 0.15663
## f_vcg_max_var -3.946e-02 2.613e-02 -1.510 0.13162
## f_vg_mean -7.060e-02 8.772e-02 -0.805 0.42131
## f_vg_median 1.965e-02 3.672e-02 0.535 0.59289
## f_vg_min -2.946e-01 1.828e-01 -1.612 0.10768
## f_vg_max 7.702e-02 2.434e-02 3.164 0.00166 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2308 on 473 degrees of freedom
## Multiple R-squared: 0.2269, Adjusted R-squared: 0.2007
## F-statistic: 8.674 on 16 and 473 DF, p-value: < 2.2e-16